Before this course

  • 歡迎任何問題,課程中有問題請隨時到臉書 上問我。

About R

R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R. …… R is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.

R-4.5.1 was released since 2025 June.


How/where to obtain and install R?

If you use linux as your default os, you can install R from the package repositories of each distribution directly. Alternatively, you can download R binary-version or source code from CRAN if you use M$ windows or Mac OS.

Ubuntu users

  • Update indices with sudo apt update -qq
  • Install two helper packages we need sudo apt install --no-install-recommends software-properties-common dirmngr
  • Add the signing key with wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | sudo tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc
  • Add the repo from CRAN with sudo add-apt-repository "deb https://cloud.r-project.org/bin/linux/ubuntu $(lsb_release -cs)-cran40/"
  • Install R with sudo apt install --no-install-recommends r-base r-base-dev

MacOS users

  • Intel x86-64: R-4.5.1-x86_64.pkg
  • Apple silicon arm64: R-4.5.1-arm64.pkg

Windows users

  • Download the latest version of R from CRAN
  • Install R with the downloaded installer
  • Install Rtools from CRAN
    • Rtools is a collection of tools for building R packages on Windows. It includes a compiler, a set of libraries, and other tools that are needed to build R packages from source.
  • Add Rtools to your PATH environment variable
    • Open the Control Panel and go to System and Security > System > Advanced system settings > Environment Variables.
    • Under System variables, find the PATH variable and click Edit.
    • Add the path to the Rtools bin directory (e.g., C:\Rtools\bin) to the PATH variable.
    • Click OK to save the changes.

CRAN Repositories


Using RStudio as your default R-programming IDE

About RStudio

RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.

Install the most suitable version of RStudio for your needs.

  • Desktop version: Access RStudio locally.
  • Server version: Access via a web browser.

Other choices

  • : VS code
  • : Vim+Nvim-R
  • Any other text editors: gedit, emacs(+ESS), eclipse and etc.

The most important step when beginning to learn R is using help()

help() & help.search()

help(help)
help.search("standard deviation")

? & ??

?mean
??hypergeometric

Package installation & PATH setting

Installing packages in R console

# Download Pkgs from CRAN repository & install
install.packages('rmarkdown',                        # Package name
                 repo="http://cran.csie.ntu.edu.tw", # The URL of CRAN repository
                 destdir="~/Download",               # The directory where downloaded pkgs are stored
                 lib=.libPaths()[1])                 # The directory where to install pkgs

# Install Pkgs from downloaded source code
install.packages('~/Download/rmarkdown_0.5.1.tar.gz',
                 repos=NULL,
                 type="source",
                 lib=.libPaths()[1])

Installing packages in terminal

$ R CMD INSTALL -l $HOME/R/4.1 rmarkdown_0.5.1.tar.gz

Setting PATH

.libPaths(new)  # .libPaths("/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library")

Some Pkgs should be downloaded/installed from R-forge

Set install.packages(Pkg, repo='http://R-Forge.R-project.org')

Using the package installed

library(Pkg)
require(Pkg) # Avoid to use this!

What is the difference between require() and library()


Bioconductor

About Bioconductor

Bioconductor provides tools for the analysis and comprehension of high-throughput genomic data. Bioconductor uses the R statistical programming language, and is open source and open development.

Install Pkgs from Bioconductor

# Install BiocManager
install.packages("BiocManager")
BiocManager::install(pkgname)

Rich course materials

Courses & conference


Basic operation

5+5
5-3
5*3
5/3
5^3
10%%3

# Variable declaration
x <- 5 # '<-' is assign operator in R, which is equivalent to '='
y <- function(i) mean(i)

Data and object types

Data types

  • numeric: c(1:3, 5 ,7)
  • character: c("1","2","3"); LETTERS[1:3]
  • logical: TRUE; FALSE
  • complex: 1, b, 3

Object types

  • vector: the data types of all elements in a vector must be consistent!
x <- 1:5
y <- c(6,7,8,9,10)
z <- x - y
print(z)
## [1] -5 -5 -5 -5 -5
# Vectorized code performs better!
a <- 1:100000
system.time(mean(a))
##    user  system elapsed 
##       0       0       0
total <- 0
system.time(for (i in a) {total <- total + i; total/100000})
##    user  system elapsed 
##   0.002   0.000   0.002
  • matrix
x <- matrix(rnorm(100), nr=20, nc=5)
print(x)
##              [,1]         [,2]       [,3]        [,4]        [,5]
##  [1,]  2.55038124  0.184262481 -0.4561954  1.14895276 -1.92047276
##  [2,] -1.58578387 -0.007294847 -0.1266443  0.65476618  1.48662616
##  [3,] -0.94487570 -0.256090246  1.7646225  2.37595500  0.79648732
##  [4,] -1.25740801  1.174664052 -0.2192437 -0.57965671 -1.87931963
##  [5,]  1.21101090 -0.250846972  0.6736424  0.06751027  0.15817507
##  [6,]  0.10050912  0.724036994  0.4626070 -2.18041770 -0.47613009
##  [7,] -1.67498300 -0.108837647 -1.1641596  1.40305109 -0.57147186
##  [8,] -1.65563785 -1.109198549  0.3337019  0.45362140  0.40031438
##  [9,] -0.31053419 -2.128176652  0.1197364 -0.87086882 -0.01925781
## [10,]  0.54898811 -2.074307611  1.7757706 -0.60428193  1.39690828
## [11,] -1.58184057  0.036519672  1.1332032  1.18419250 -0.36168788
## [12,]  0.77839776  1.624411608  0.5920319  1.71761575  1.45908984
## [13,]  1.31375889  0.500091929  1.1144533 -1.75353085 -0.32319226
## [14,]  0.05202539  0.249442979 -1.3079697 -3.17371927  0.08209465
## [15,]  1.05060515  1.758710198  0.3199155 -1.45387452  0.02531780
## [16,]  1.85035452 -1.019057568  0.6645884  0.63302917 -1.43927096
## [17,]  0.26549603 -0.637796473 -1.0054882  0.88026645  0.52535503
## [18,] -0.20088955 -0.992080587 -0.3495000 -1.15589046  0.09926207
## [19,]  0.21709471 -0.802507844  1.7528800 -0.26107060 -1.14192450
## [20,] -1.45922886  0.428763413  1.2400165  0.43486224 -1.81211307
x[1,3]
x[2:4,]
x[,3:5]
x %*% t(x)

# A matrix is a vector with subscripts!
x[1:3]
x[1:3,1]
  • array
y <- array(rnorm(64), c(8,4,2))
print(y) # An array is also a vector with subscripts!
## , , 1
## 
##             [,1]       [,2]       [,3]        [,4]
## [1,] -0.24242102  0.2933525  0.2349756  0.43253848
## [2,] -1.28912989  1.2934986  1.8828807  0.05210916
## [3,]  0.10792740  0.6808473  0.7594639  0.34553260
## [4,] -0.42420471  0.3020578  0.1506608 -0.78573316
## [5,] -1.52540753 -1.1579384 -0.2453179  0.95768947
## [6,]  1.16660007 -0.3250908 -0.4550057  0.44779420
## [7,]  0.40456494  0.4160561 -0.5125978 -0.70044813
## [8,]  0.08019048 -0.1122778 -1.7348882  0.21477218
## 
## , , 2
## 
##             [,1]        [,2]        [,3]       [,4]
## [1,]  0.56596731 -1.62993185 -0.03481932  1.4169964
## [2,]  0.19394806  0.74626495  1.62493451 -0.6921677
## [3,] -0.06247529 -1.01941115  1.30076147 -1.3304937
## [4,]  0.69862984 -1.23584830 -0.65781036  1.1487815
## [5,]  0.12240414 -0.67416104  0.70132428 -0.1913586
## [6,]  1.41164163 -0.04719757 -2.08769008 -1.5000244
## [7,] -0.50737293  1.06244430 -0.25884372 -1.4470308
## [8,] -0.22431286 -1.15206621  1.12773303 -0.9611897
  • list: the data types of elements in a list could be complex
x<-list(1:5, c("a","b","c"), matrix(rnorm(10),nr=5,nc=2))
print(x)
## [[1]]
## [1] 1 2 3 4 5
## 
## [[2]]
## [1] "a" "b" "c"
## 
## [[3]]
##            [,1]       [,2]
## [1,]  0.1659226  0.1472247
## [2,]  0.5536150 -1.8438265
## [3,] -0.7093551  0.6275619
## [4,] -0.1035621 -0.4417174
## [5,]  1.2053090 -0.8757276
x$mylist <- x
print(x)
## [[1]]
## [1] 1 2 3 4 5
## 
## [[2]]
## [1] "a" "b" "c"
## 
## [[3]]
##            [,1]       [,2]
## [1,]  0.1659226  0.1472247
## [2,]  0.5536150 -1.8438265
## [3,] -0.7093551  0.6275619
## [4,] -0.1035621 -0.4417174
## [5,]  1.2053090 -0.8757276
## 
## $mylist
## $mylist[[1]]
## [1] 1 2 3 4 5
## 
## $mylist[[2]]
## [1] "a" "b" "c"
## 
## $mylist[[3]]
##            [,1]       [,2]
## [1,]  0.1659226  0.1472247
## [2,]  0.5536150 -1.8438265
## [3,] -0.7093551  0.6275619
## [4,] -0.1035621 -0.4417174
## [5,]  1.2053090 -0.8757276
  • data frame: a data frame is collection of multiple lists with the same length
df<-data.frame(num=1:10, 
           char=LETTERS[1:10], 
           logic=sample(c(TRUE,FALSE), 10, replace=TRUE))

df
##    num char logic
## 1    1    A  TRUE
## 2    2    B FALSE
## 3    3    C  TRUE
## 4    4    D FALSE
## 5    5    E FALSE
## 6    6    F FALSE
## 7    7    G  TRUE
## 8    8    H FALSE
## 9    9    I FALSE
## 10  10    J FALSE
df$char
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
df$logic[5:7]
## [1] FALSE FALSE  TRUE
  • factor: An R factor might be viewed simply as a vector with a bit more information added (though, as seen below, it’s different from this internally). That extra information consists of a record of the distinct values in that vector, called levels.
x <- c(5, 12, 32, 12)
xf <- factor(x)
print(xf)
## [1] 5  12 32 12
## Levels: 5 12 32

So…. a factor looks like a vector, right?

str(xf) # Here str stands for structure. This function shows the internal structure of any R object.
##  Factor w/ 3 levels "5","12","32": 1 2 3 2
unclass(xf)
## [1] 1 2 3 2
## attr(,"levels")
## [1] "5"  "12" "32"
length(xf)
## [1] 4

What??? What are you talking about?

x <- c(5, 12, 13, 12)
xff <- factor(x, levels=c(5, 12, 13, 88))
xff
## [1] 5  12 13 12
## Levels: 5 12 13 88
xff[2] <- 88 
xff
## [1] 5  88 13 12
## Levels: 5 12 13 88
xff[2] <- 28 # You cannot sneak in an "illegal" level
## Warning in `[<-.factor`(`*tmp*`, 2, value = 28): invalid factor level, NA
## generated
  • table: Another common way to store information is in a table.
# One way table
a <- factor(c("A","A","B","A","B","B","C","A","C"))
a
## [1] A A B A B B C A C
## Levels: A B C
a.table <- table(a)
a.table
## a
## A B C 
## 4 3 2
attributes(a.table)
## $dim
## [1] 3
## 
## $dimnames
## $dimnames$a
## [1] "A" "B" "C"
## 
## 
## $class
## [1] "table"
# Two way table
a <- c("Sometimes","Sometimes","Never","Always","Always","Sometimes","Sometimes","Never")
b <- c("Maybe","Maybe","Yes","Maybe","Maybe","No","Yes","No")
twoway.table <- table(a,b)
twoway.table
##            b
## a           Maybe No Yes
##   Always        2  0   0
##   Never         0  1   1
##   Sometimes     2  1   1
# An example
sexsmoke<-matrix(c(70,120,65,140),ncol=2,byrow=TRUE)
rownames(sexsmoke)<-c("male","female")
colnames(sexsmoke)<-c("smoke","nosmoke")
sexsmoke <- as.table(sexsmoke)
sexsmoke
##        smoke nosmoke
## male      70     120
## female    65     140

Control structures

Conditional excutions

  • equal: ==
  • not equal: !=
  • greater/less than: >, <
  • greater/less than or equal: >=, <=

Logical operators

  • and: &, &&
  • or: |, ||
  • not: !

if-else statements

if (cond1==TRUE) {cmd1} else {cmd2}
# Example
if (1 == 0) {
    print(1)
} else {
    print(2)
}
## [1] 2

ifelse statements (ternary operator in R)

ifelse(test, true_value, false_value)
x <- 1:10
ifelse(x<5|x>8, x, 0)
##  [1]  1  2  3  4  0  0  0  0  9 10

switch-case statements

AA <- 'foo'
switch(AA,
       foo = {print('AA is foo')},
       bar = {print('AA is bar')},
       {print('Default')}
)
## [1] "AA is foo"

Loops

For loop

for (var in vector) {
    statement
}
# Example
mydf <- iris
head(mydf)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
myve <- NULL
for (i in 1:nrow(mydf)) {
    myve <- c(myve, mean(as.numeric(mydf[i, 1:3])))
}
myve
##   [1] 3.333333 3.100000 3.066667 3.066667 3.333333 3.666667 3.133333 3.300000
##   [9] 2.900000 3.166667 3.533333 3.266667 3.066667 2.800000 3.666667 3.866667
##  [17] 3.533333 3.333333 3.733333 3.466667 3.500000 3.433333 3.066667 3.366667
##  [25] 3.366667 3.200000 3.333333 3.400000 3.333333 3.166667 3.166667 3.433333
##  [33] 3.600000 3.700000 3.166667 3.133333 3.433333 3.300000 2.900000 3.333333
##  [41] 3.266667 2.700000 2.966667 3.366667 3.600000 3.066667 3.500000 3.066667
##  [49] 3.500000 3.233333 4.966667 4.700000 4.966667 3.933333 4.633333 4.333333
##  [57] 4.766667 3.533333 4.700000 3.933333 3.500000 4.366667 4.066667 4.566667
##  [65] 4.033333 4.733333 4.366667 4.200000 4.300000 4.000000 4.633333 4.300000
##  [73] 4.566667 4.533333 4.533333 4.666667 4.800000 4.900000 4.466667 3.933333
##  [81] 3.900000 3.866667 4.133333 4.600000 4.300000 4.633333 4.833333 4.333333
##  [89] 4.233333 4.000000 4.166667 4.566667 4.133333 3.533333 4.166667 4.300000
##  [97] 4.266667 4.466667 3.533333 4.200000 5.200000 4.533333 5.333333 4.933333
## [105] 5.100000 5.733333 3.966667 5.500000 5.000000 5.633333 4.933333 4.800000
## [113] 5.100000 4.400000 4.566667 4.966667 5.000000 6.066667 5.733333 4.400000
## [121] 5.266667 4.433333 5.733333 4.633333 5.233333 5.466667 4.600000 4.666667
## [129] 4.933333 5.333333 5.433333 6.033333 4.933333 4.733333 4.766667 5.600000
## [137] 5.100000 5.000000 4.600000 5.133333 5.133333 5.033333 4.533333 5.300000
## [145] 5.233333 4.966667 4.600000 4.900000 5.000000 4.666667

while loop

while (condition) statements
# Example
z <- 0
while (z < 5) {
    z <- z + 2
    print(z)
}
## [1] 2
## [1] 4
## [1] 6

apply loop

For matrix/array
apply(X, MARGIN, FUN, ARGS)

# Examples
apply(iris[,1:3], 1, mean)

x <- 1:10

apply(as.matrix(x), 1, function(i) {
    if (i < 5) 
        i - 1 
    else 
        i/i
})
For vector/list
lapply(X, FUN)
sapply(X, FUN)
# Examples
mylist <- as.list(iris[1:3, 1:3])
mylist
## $Sepal.Length
## [1] 5.1 4.9 4.7
## 
## $Sepal.Width
## [1] 3.5 3.0 3.2
## 
## $Petal.Length
## [1] 1.4 1.4 1.3
lapply(mylist, sum) # Compute sum of each list component and return result as list
## $Sepal.Length
## [1] 14.7
## 
## $Sepal.Width
## [1] 9.7
## 
## $Petal.Length
## [1] 4.1
sapply(mylist, sum) # Compute sum of each list component and return result as vector
## Sepal.Length  Sepal.Width Petal.Length 
##         14.7          9.7          4.1
More apply functions
  • tapply
  • mapply

function

FunctionName <- function(arg1, arg2, ...) { 
    statements
    return(R_object)
}
add <- function(a, b) {
    c <- a + b
    return(c)
}
x <- 5
y <- 7
z <- add(x,y)
z
## [1] 12

Advanced R programming

Garbage collection

  • rm()
  • gc()
x <- as.matrix(read.table("test.csv", sep="\t")) # x is a 4500000 x 220 matrix
y <- apply(x, 1, mean)
rm(list=c("x","y"))
gc()

Use data.table to speed up acquisition of data

See Introduction to the data.table package in R

Fast aggregation of large data (e.g. 100GB in RAM), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns and a fast file reader (fread). Offers a natural and flexible syntax, for faster development. - from CRAN

library(data.table)
grpsize <- ceiling(1e7/26^2)
DF <- data.frame(
    x=rep(LETTERS, each=26*grpsize),
    y=rep(letters, each=grpsize),
    v=runif(grpsize*26^2),
    stringsAsFactors=FALSE)
system.time(ans1 <- DF[DF$x=="R" & DF$y=="h",])
##    user  system elapsed 
##   0.058   0.009   0.066
DT <- as.data.table(DF)
setkey(DT, x, y)
system.time(ans2 <- DT[list("R","h")])
##    user  system elapsed 
##   0.014   0.001   0.004

Tidyverse

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying philosophy and common APIs.

Hadley Wickham
Hadley Wickham
install.packages("tidyverse")
  • magrittr > A Forward-Pipe Operator for R

Use this equation as an example:

\[ \LARGE \boldsymbol{log(\sum_{i=1}^{n}exp(x_i))} \]

In R, you may want to calculate the equation with many functions like this:

log(sum(exp(MyData)), exp(1))

With magrittr, you can calculate the equation like this:

MyData %>% exp %>% sum %>% log(exp(1))
  • plyr

“ plyr is a set of tools that solves a common set of problems: you need to break a big problem down into manageable pieces, operate on each pieces and then put all the pieces back together. It’s already possible to do this with split and the apply functions, but plyr just makes it all a bit easier. . . ”

set.seed(1)
d <- data.frame(year = rep(2000:2005, each=3),
                count = round(runif(runif(18, 0, 20)))
                )

print(d)
##    year count
## 1  2000     0
## 2  2000     1
## 3  2000     1
## 4  2001     0
## 5  2001     1
## 6  2001     0
## 7  2002     0
## 8  2002     0
## 9  2002     0
## 10 2003     0
## 11 2003     1
## 12 2003     0
## 13 2004     0
## 14 2004     1
## 15 2004     0
## 16 2005     0
## 17 2005     1
## 18 2005     1
library(plyr)
ddply(d, "year", function(x) {
    mean.count <- mean(x$count)
    sd.count <- sd(x$count)
    cv <- sd.count/mean.count
    data.frame(cv.count=cv)
})
##   year  cv.count
## 1 2000 0.8660254
## 2 2001 1.7320508
## 3 2002       NaN
## 4 2003 1.7320508
## 5 2004 1.7320508
## 6 2005 0.8660254
  • dplyr > dplyr is a package for data manipulation, written and maintained by Hadley Wickham. It provides some great, easy-to-use functions that are very handy when performing exploratory data analysis and manipulation.

    • filter(): the function will return all the rows that satisfy a following condition.
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Let's start with a dataset about air quality
head(airquality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6
# Filter the records with Temp <= 70
filter(airquality, Temp > 70)
##     Ozone Solar.R Wind Temp Month Day
## 1      36     118  8.0   72     5   2
## 2      12     149 12.6   74     5   3
## 3       7      NA  6.9   74     5  11
## 4      11     320 16.6   73     5  22
## 5      45     252 14.9   81     5  29
## 6     115     223  5.7   79     5  30
## 7      37     279  7.4   76     5  31
## 8      NA     286  8.6   78     6   1
## 9      NA     287  9.7   74     6   2
## 10     NA     186  9.2   84     6   4
## 11     NA     220  8.6   85     6   5
## 12     NA     264 14.3   79     6   6
## 13     29     127  9.7   82     6   7
## 14     NA     273  6.9   87     6   8
## 15     71     291 13.8   90     6   9
## 16     39     323 11.5   87     6  10
## 17     NA     259 10.9   93     6  11
## 18     NA     250  9.2   92     6  12
## 19     23     148  8.0   82     6  13
## 20     NA     332 13.8   80     6  14
## 21     NA     322 11.5   79     6  15
## 22     21     191 14.9   77     6  16
## 23     37     284 20.7   72     6  17
## 24     12     120 11.5   73     6  19
## 25     13     137 10.3   76     6  20
## 26     NA     150  6.3   77     6  21
## 27     NA      59  1.7   76     6  22
## 28     NA      91  4.6   76     6  23
## 29     NA     250  6.3   76     6  24
## 30     NA     135  8.0   75     6  25
## 31     NA     127  8.0   78     6  26
## 32     NA      47 10.3   73     6  27
## 33     NA      98 11.5   80     6  28
## 34     NA      31 14.9   77     6  29
## 35     NA     138  8.0   83     6  30
## 36    135     269  4.1   84     7   1
## 37     49     248  9.2   85     7   2
## 38     32     236  9.2   81     7   3
## 39     NA     101 10.9   84     7   4
## 40     64     175  4.6   83     7   5
## 41     40     314 10.9   83     7   6
## 42     77     276  5.1   88     7   7
## 43     97     267  6.3   92     7   8
## 44     97     272  5.7   92     7   9
## 45     85     175  7.4   89     7  10
## 46     NA     139  8.6   82     7  11
## 47     10     264 14.3   73     7  12
## 48     27     175 14.9   81     7  13
## 49     NA     291 14.9   91     7  14
## 50      7      48 14.3   80     7  15
## 51     48     260  6.9   81     7  16
## 52     35     274 10.3   82     7  17
## 53     61     285  6.3   84     7  18
## 54     79     187  5.1   87     7  19
## 55     63     220 11.5   85     7  20
## 56     16       7  6.9   74     7  21
## 57     NA     258  9.7   81     7  22
## 58     NA     295 11.5   82     7  23
## 59     80     294  8.6   86     7  24
## 60    108     223  8.0   85     7  25
## 61     20      81  8.6   82     7  26
## 62     52      82 12.0   86     7  27
## 63     82     213  7.4   88     7  28
## 64     50     275  7.4   86     7  29
## 65     64     253  7.4   83     7  30
## 66     59     254  9.2   81     7  31
## 67     39      83  6.9   81     8   1
## 68      9      24 13.8   81     8   2
## 69     16      77  7.4   82     8   3
## 70     78      NA  6.9   86     8   4
## 71     35      NA  7.4   85     8   5
## 72     66      NA  4.6   87     8   6
## 73    122     255  4.0   89     8   7
## 74     89     229 10.3   90     8   8
## 75    110     207  8.0   90     8   9
## 76     NA     222  8.6   92     8  10
## 77     NA     137 11.5   86     8  11
## 78     44     192 11.5   86     8  12
## 79     28     273 11.5   82     8  13
## 80     65     157  9.7   80     8  14
## 81     NA      64 11.5   79     8  15
## 82     22      71 10.3   77     8  16
## 83     59      51  6.3   79     8  17
## 84     23     115  7.4   76     8  18
## 85     31     244 10.9   78     8  19
## 86     44     190 10.3   78     8  20
## 87     21     259 15.5   77     8  21
## 88      9      36 14.3   72     8  22
## 89     NA     255 12.6   75     8  23
## 90     45     212  9.7   79     8  24
## 91    168     238  3.4   81     8  25
## 92     73     215  8.0   86     8  26
## 93     NA     153  5.7   88     8  27
## 94     76     203  9.7   97     8  28
## 95    118     225  2.3   94     8  29
## 96     84     237  6.3   96     8  30
## 97     85     188  6.3   94     8  31
## 98     96     167  6.9   91     9   1
## 99     78     197  5.1   92     9   2
## 100    73     183  2.8   93     9   3
## 101    91     189  4.6   93     9   4
## 102    47      95  7.4   87     9   5
## 103    32      92 15.5   84     9   6
## 104    20     252 10.9   80     9   7
## 105    23     220 10.3   78     9   8
## 106    21     230 10.9   75     9   9
## 107    24     259  9.7   73     9  10
## 108    44     236 14.9   81     9  11
## 109    21     259 15.5   76     9  12
## 110    28     238  6.3   77     9  13
## 111     9      24 10.9   71     9  14
## 112    13     112 11.5   71     9  15
## 113    46     237  6.9   78     9  16
## 114    13      27 10.3   76     9  18
## 115    16     201  8.0   82     9  20
## 116    23      14  9.2   71     9  22
## 117    36     139 10.3   81     9  23
## 118    NA     145 13.2   77     9  27
## 119    14     191 14.3   75     9  28
## 120    18     131  8.0   76     9  29
# Select the records with Temp > 80 & Month is after May
filter(airquality, Temp > 80 & Month > 5)
##    Ozone Solar.R Wind Temp Month Day
## 1     NA     186  9.2   84     6   4
## 2     NA     220  8.6   85     6   5
## 3     29     127  9.7   82     6   7
## 4     NA     273  6.9   87     6   8
## 5     71     291 13.8   90     6   9
## 6     39     323 11.5   87     6  10
## 7     NA     259 10.9   93     6  11
## 8     NA     250  9.2   92     6  12
## 9     23     148  8.0   82     6  13
## 10    NA     138  8.0   83     6  30
## 11   135     269  4.1   84     7   1
## 12    49     248  9.2   85     7   2
## 13    32     236  9.2   81     7   3
## 14    NA     101 10.9   84     7   4
## 15    64     175  4.6   83     7   5
## 16    40     314 10.9   83     7   6
## 17    77     276  5.1   88     7   7
## 18    97     267  6.3   92     7   8
## 19    97     272  5.7   92     7   9
## 20    85     175  7.4   89     7  10
## 21    NA     139  8.6   82     7  11
## 22    27     175 14.9   81     7  13
## 23    NA     291 14.9   91     7  14
## 24    48     260  6.9   81     7  16
## 25    35     274 10.3   82     7  17
## 26    61     285  6.3   84     7  18
## 27    79     187  5.1   87     7  19
## 28    63     220 11.5   85     7  20
## 29    NA     258  9.7   81     7  22
## 30    NA     295 11.5   82     7  23
## 31    80     294  8.6   86     7  24
## 32   108     223  8.0   85     7  25
## 33    20      81  8.6   82     7  26
## 34    52      82 12.0   86     7  27
## 35    82     213  7.4   88     7  28
## 36    50     275  7.4   86     7  29
## 37    64     253  7.4   83     7  30
## 38    59     254  9.2   81     7  31
## 39    39      83  6.9   81     8   1
## 40     9      24 13.8   81     8   2
## 41    16      77  7.4   82     8   3
## 42    78      NA  6.9   86     8   4
## 43    35      NA  7.4   85     8   5
## 44    66      NA  4.6   87     8   6
## 45   122     255  4.0   89     8   7
## 46    89     229 10.3   90     8   8
## 47   110     207  8.0   90     8   9
## 48    NA     222  8.6   92     8  10
## 49    NA     137 11.5   86     8  11
## 50    44     192 11.5   86     8  12
## 51    28     273 11.5   82     8  13
## 52   168     238  3.4   81     8  25
## 53    73     215  8.0   86     8  26
## 54    NA     153  5.7   88     8  27
## 55    76     203  9.7   97     8  28
## 56   118     225  2.3   94     8  29
## 57    84     237  6.3   96     8  30
## 58    85     188  6.3   94     8  31
## 59    96     167  6.9   91     9   1
## 60    78     197  5.1   92     9   2
## 61    73     183  2.8   93     9   3
## 62    91     189  4.6   93     9   4
## 63    47      95  7.4   87     9   5
## 64    32      92 15.5   84     9   6
## 65    44     236 14.9   81     9  11
## 66    16     201  8.0   82     9  20
## 67    36     139 10.3   81     9  23
  • mutate(): the function is used to add new variables to the data.
mutate(airquality, TempInC = (Temp - 32) * 5 / 9)
##     Ozone Solar.R Wind Temp Month Day  TempInC
## 1      41     190  7.4   67     5   1 19.44444
## 2      36     118  8.0   72     5   2 22.22222
## 3      12     149 12.6   74     5   3 23.33333
## 4      18     313 11.5   62     5   4 16.66667
## 5      NA      NA 14.3   56     5   5 13.33333
## 6      28      NA 14.9   66     5   6 18.88889
## 7      23     299  8.6   65     5   7 18.33333
## 8      19      99 13.8   59     5   8 15.00000
## 9       8      19 20.1   61     5   9 16.11111
## 10     NA     194  8.6   69     5  10 20.55556
## 11      7      NA  6.9   74     5  11 23.33333
## 12     16     256  9.7   69     5  12 20.55556
## 13     11     290  9.2   66     5  13 18.88889
## 14     14     274 10.9   68     5  14 20.00000
## 15     18      65 13.2   58     5  15 14.44444
## 16     14     334 11.5   64     5  16 17.77778
## 17     34     307 12.0   66     5  17 18.88889
## 18      6      78 18.4   57     5  18 13.88889
## 19     30     322 11.5   68     5  19 20.00000
## 20     11      44  9.7   62     5  20 16.66667
## 21      1       8  9.7   59     5  21 15.00000
## 22     11     320 16.6   73     5  22 22.77778
## 23      4      25  9.7   61     5  23 16.11111
## 24     32      92 12.0   61     5  24 16.11111
## 25     NA      66 16.6   57     5  25 13.88889
## 26     NA     266 14.9   58     5  26 14.44444
## 27     NA      NA  8.0   57     5  27 13.88889
## 28     23      13 12.0   67     5  28 19.44444
## 29     45     252 14.9   81     5  29 27.22222
## 30    115     223  5.7   79     5  30 26.11111
## 31     37     279  7.4   76     5  31 24.44444
## 32     NA     286  8.6   78     6   1 25.55556
## 33     NA     287  9.7   74     6   2 23.33333
## 34     NA     242 16.1   67     6   3 19.44444
## 35     NA     186  9.2   84     6   4 28.88889
## 36     NA     220  8.6   85     6   5 29.44444
## 37     NA     264 14.3   79     6   6 26.11111
## 38     29     127  9.7   82     6   7 27.77778
## 39     NA     273  6.9   87     6   8 30.55556
## 40     71     291 13.8   90     6   9 32.22222
## 41     39     323 11.5   87     6  10 30.55556
## 42     NA     259 10.9   93     6  11 33.88889
## 43     NA     250  9.2   92     6  12 33.33333
## 44     23     148  8.0   82     6  13 27.77778
## 45     NA     332 13.8   80     6  14 26.66667
## 46     NA     322 11.5   79     6  15 26.11111
## 47     21     191 14.9   77     6  16 25.00000
## 48     37     284 20.7   72     6  17 22.22222
## 49     20      37  9.2   65     6  18 18.33333
## 50     12     120 11.5   73     6  19 22.77778
## 51     13     137 10.3   76     6  20 24.44444
## 52     NA     150  6.3   77     6  21 25.00000
## 53     NA      59  1.7   76     6  22 24.44444
## 54     NA      91  4.6   76     6  23 24.44444
## 55     NA     250  6.3   76     6  24 24.44444
## 56     NA     135  8.0   75     6  25 23.88889
## 57     NA     127  8.0   78     6  26 25.55556
## 58     NA      47 10.3   73     6  27 22.77778
## 59     NA      98 11.5   80     6  28 26.66667
## 60     NA      31 14.9   77     6  29 25.00000
## 61     NA     138  8.0   83     6  30 28.33333
## 62    135     269  4.1   84     7   1 28.88889
## 63     49     248  9.2   85     7   2 29.44444
## 64     32     236  9.2   81     7   3 27.22222
## 65     NA     101 10.9   84     7   4 28.88889
## 66     64     175  4.6   83     7   5 28.33333
## 67     40     314 10.9   83     7   6 28.33333
## 68     77     276  5.1   88     7   7 31.11111
## 69     97     267  6.3   92     7   8 33.33333
## 70     97     272  5.7   92     7   9 33.33333
## 71     85     175  7.4   89     7  10 31.66667
## 72     NA     139  8.6   82     7  11 27.77778
## 73     10     264 14.3   73     7  12 22.77778
## 74     27     175 14.9   81     7  13 27.22222
## 75     NA     291 14.9   91     7  14 32.77778
## 76      7      48 14.3   80     7  15 26.66667
## 77     48     260  6.9   81     7  16 27.22222
## 78     35     274 10.3   82     7  17 27.77778
## 79     61     285  6.3   84     7  18 28.88889
## 80     79     187  5.1   87     7  19 30.55556
## 81     63     220 11.5   85     7  20 29.44444
## 82     16       7  6.9   74     7  21 23.33333
## 83     NA     258  9.7   81     7  22 27.22222
## 84     NA     295 11.5   82     7  23 27.77778
## 85     80     294  8.6   86     7  24 30.00000
## 86    108     223  8.0   85     7  25 29.44444
## 87     20      81  8.6   82     7  26 27.77778
## 88     52      82 12.0   86     7  27 30.00000
## 89     82     213  7.4   88     7  28 31.11111
## 90     50     275  7.4   86     7  29 30.00000
## 91     64     253  7.4   83     7  30 28.33333
## 92     59     254  9.2   81     7  31 27.22222
## 93     39      83  6.9   81     8   1 27.22222
## 94      9      24 13.8   81     8   2 27.22222
## 95     16      77  7.4   82     8   3 27.77778
## 96     78      NA  6.9   86     8   4 30.00000
## 97     35      NA  7.4   85     8   5 29.44444
## 98     66      NA  4.6   87     8   6 30.55556
## 99    122     255  4.0   89     8   7 31.66667
## 100    89     229 10.3   90     8   8 32.22222
## 101   110     207  8.0   90     8   9 32.22222
## 102    NA     222  8.6   92     8  10 33.33333
## 103    NA     137 11.5   86     8  11 30.00000
## 104    44     192 11.5   86     8  12 30.00000
## 105    28     273 11.5   82     8  13 27.77778
## 106    65     157  9.7   80     8  14 26.66667
## 107    NA      64 11.5   79     8  15 26.11111
## 108    22      71 10.3   77     8  16 25.00000
## 109    59      51  6.3   79     8  17 26.11111
## 110    23     115  7.4   76     8  18 24.44444
## 111    31     244 10.9   78     8  19 25.55556
## 112    44     190 10.3   78     8  20 25.55556
## 113    21     259 15.5   77     8  21 25.00000
## 114     9      36 14.3   72     8  22 22.22222
## 115    NA     255 12.6   75     8  23 23.88889
## 116    45     212  9.7   79     8  24 26.11111
## 117   168     238  3.4   81     8  25 27.22222
## 118    73     215  8.0   86     8  26 30.00000
## 119    NA     153  5.7   88     8  27 31.11111
## 120    76     203  9.7   97     8  28 36.11111
## 121   118     225  2.3   94     8  29 34.44444
## 122    84     237  6.3   96     8  30 35.55556
## 123    85     188  6.3   94     8  31 34.44444
## 124    96     167  6.9   91     9   1 32.77778
## 125    78     197  5.1   92     9   2 33.33333
## 126    73     183  2.8   93     9   3 33.88889
## 127    91     189  4.6   93     9   4 33.88889
## 128    47      95  7.4   87     9   5 30.55556
## 129    32      92 15.5   84     9   6 28.88889
## 130    20     252 10.9   80     9   7 26.66667
## 131    23     220 10.3   78     9   8 25.55556
## 132    21     230 10.9   75     9   9 23.88889
## 133    24     259  9.7   73     9  10 22.77778
## 134    44     236 14.9   81     9  11 27.22222
## 135    21     259 15.5   76     9  12 24.44444
## 136    28     238  6.3   77     9  13 25.00000
## 137     9      24 10.9   71     9  14 21.66667
## 138    13     112 11.5   71     9  15 21.66667
## 139    46     237  6.9   78     9  16 25.55556
## 140    18     224 13.8   67     9  17 19.44444
## 141    13      27 10.3   76     9  18 24.44444
## 142    24     238 10.3   68     9  19 20.00000
## 143    16     201  8.0   82     9  20 27.77778
## 144    13     238 12.6   64     9  21 17.77778
## 145    23      14  9.2   71     9  22 21.66667
## 146    36     139 10.3   81     9  23 27.22222
## 147     7      49 10.3   69     9  24 20.55556
## 148    14      20 16.6   63     9  25 17.22222
## 149    30     193  6.9   70     9  26 21.11111
## 150    NA     145 13.2   77     9  27 25.00000
## 151    14     191 14.3   75     9  28 23.88889
## 152    18     131  8.0   76     9  29 24.44444
## 153    20     223 11.5   68     9  30 20.00000
  • summarise(): the function is used to summarise multiple values into a single value.
summarise(airquality, mean(Temp, na.rm = TRUE))
##   mean(Temp, na.rm = TRUE)
## 1                 77.88235
  • group_by(): the function is used to group data by one or more variables.
summarise(group_by(airquality, Month), mean(Temp, na.rm = TRUE))
## # A tibble: 5 × 2
##   Month `mean(Temp, na.rm = TRUE)`
##   <int>                      <dbl>
## 1     5                       65.5
## 2     6                       79.1
## 3     7                       83.9
## 4     8                       84.0
## 5     9                       76.9
  • sample_n() and sample_frac(): these two functions are used to select random rows from a table.
sample_n(airquality, size = 10)
##    Ozone Solar.R Wind Temp Month Day
## 1     NA     295 11.5   82     7  23
## 2     97     272  5.7   92     7   9
## 3     27     175 14.9   81     7  13
## 4     NA     259 10.9   93     6  11
## 5     31     244 10.9   78     8  19
## 6     14      20 16.6   63     9  25
## 7     11      44  9.7   62     5  20
## 8     23     148  8.0   82     6  13
## 9    118     225  2.3   94     8  29
## 10    20      81  8.6   82     7  26
sample_frac(airquality, size = 0.1)
##    Ozone Solar.R Wind Temp Month Day
## 1     97     272  5.7   92     7   9
## 2    118     225  2.3   94     8  29
## 3     71     291 13.8   90     6   9
## 4     NA      66 16.6   57     5  25
## 5     NA     153  5.7   88     8  27
## 6     84     237  6.3   96     8  30
## 7     NA     273  6.9   87     6   8
## 8     NA     259 10.9   93     6  11
## 9     44     236 14.9   81     9  11
## 10    32      92 12.0   61     5  24
## 11    14     274 10.9   68     5  14
## 12    20     252 10.9   80     9   7
## 13    NA     332 13.8   80     6  14
## 14    11     320 16.6   73     5  22
## 15    NA     255 12.6   75     8  23
  • count(): the function tallies observations based on a group.
count(airquality, Month)
##   Month  n
## 1     5 31
## 2     6 30
## 3     7 31
## 4     8 31
## 5     9 30
  • arrange(): the function is used to arrange rows by variables.
arrange(airquality, desc(Month), Day)
##     Ozone Solar.R Wind Temp Month Day
## 1      96     167  6.9   91     9   1
## 2      78     197  5.1   92     9   2
## 3      73     183  2.8   93     9   3
## 4      91     189  4.6   93     9   4
## 5      47      95  7.4   87     9   5
## 6      32      92 15.5   84     9   6
## 7      20     252 10.9   80     9   7
## 8      23     220 10.3   78     9   8
## 9      21     230 10.9   75     9   9
## 10     24     259  9.7   73     9  10
## 11     44     236 14.9   81     9  11
## 12     21     259 15.5   76     9  12
## 13     28     238  6.3   77     9  13
## 14      9      24 10.9   71     9  14
## 15     13     112 11.5   71     9  15
## 16     46     237  6.9   78     9  16
## 17     18     224 13.8   67     9  17
## 18     13      27 10.3   76     9  18
## 19     24     238 10.3   68     9  19
## 20     16     201  8.0   82     9  20
## 21     13     238 12.6   64     9  21
## 22     23      14  9.2   71     9  22
## 23     36     139 10.3   81     9  23
## 24      7      49 10.3   69     9  24
## 25     14      20 16.6   63     9  25
## 26     30     193  6.9   70     9  26
## 27     NA     145 13.2   77     9  27
## 28     14     191 14.3   75     9  28
## 29     18     131  8.0   76     9  29
## 30     20     223 11.5   68     9  30
## 31     39      83  6.9   81     8   1
## 32      9      24 13.8   81     8   2
## 33     16      77  7.4   82     8   3
## 34     78      NA  6.9   86     8   4
## 35     35      NA  7.4   85     8   5
## 36     66      NA  4.6   87     8   6
## 37    122     255  4.0   89     8   7
## 38     89     229 10.3   90     8   8
## 39    110     207  8.0   90     8   9
## 40     NA     222  8.6   92     8  10
## 41     NA     137 11.5   86     8  11
## 42     44     192 11.5   86     8  12
## 43     28     273 11.5   82     8  13
## 44     65     157  9.7   80     8  14
## 45     NA      64 11.5   79     8  15
## 46     22      71 10.3   77     8  16
## 47     59      51  6.3   79     8  17
## 48     23     115  7.4   76     8  18
## 49     31     244 10.9   78     8  19
## 50     44     190 10.3   78     8  20
## 51     21     259 15.5   77     8  21
## 52      9      36 14.3   72     8  22
## 53     NA     255 12.6   75     8  23
## 54     45     212  9.7   79     8  24
## 55    168     238  3.4   81     8  25
## 56     73     215  8.0   86     8  26
## 57     NA     153  5.7   88     8  27
## 58     76     203  9.7   97     8  28
## 59    118     225  2.3   94     8  29
## 60     84     237  6.3   96     8  30
## 61     85     188  6.3   94     8  31
## 62    135     269  4.1   84     7   1
## 63     49     248  9.2   85     7   2
## 64     32     236  9.2   81     7   3
## 65     NA     101 10.9   84     7   4
## 66     64     175  4.6   83     7   5
## 67     40     314 10.9   83     7   6
## 68     77     276  5.1   88     7   7
## 69     97     267  6.3   92     7   8
## 70     97     272  5.7   92     7   9
## 71     85     175  7.4   89     7  10
## 72     NA     139  8.6   82     7  11
## 73     10     264 14.3   73     7  12
## 74     27     175 14.9   81     7  13
## 75     NA     291 14.9   91     7  14
## 76      7      48 14.3   80     7  15
## 77     48     260  6.9   81     7  16
## 78     35     274 10.3   82     7  17
## 79     61     285  6.3   84     7  18
## 80     79     187  5.1   87     7  19
## 81     63     220 11.5   85     7  20
## 82     16       7  6.9   74     7  21
## 83     NA     258  9.7   81     7  22
## 84     NA     295 11.5   82     7  23
## 85     80     294  8.6   86     7  24
## 86    108     223  8.0   85     7  25
## 87     20      81  8.6   82     7  26
## 88     52      82 12.0   86     7  27
## 89     82     213  7.4   88     7  28
## 90     50     275  7.4   86     7  29
## 91     64     253  7.4   83     7  30
## 92     59     254  9.2   81     7  31
## 93     NA     286  8.6   78     6   1
## 94     NA     287  9.7   74     6   2
## 95     NA     242 16.1   67     6   3
## 96     NA     186  9.2   84     6   4
## 97     NA     220  8.6   85     6   5
## 98     NA     264 14.3   79     6   6
## 99     29     127  9.7   82     6   7
## 100    NA     273  6.9   87     6   8
## 101    71     291 13.8   90     6   9
## 102    39     323 11.5   87     6  10
## 103    NA     259 10.9   93     6  11
## 104    NA     250  9.2   92     6  12
## 105    23     148  8.0   82     6  13
## 106    NA     332 13.8   80     6  14
## 107    NA     322 11.5   79     6  15
## 108    21     191 14.9   77     6  16
## 109    37     284 20.7   72     6  17
## 110    20      37  9.2   65     6  18
## 111    12     120 11.5   73     6  19
## 112    13     137 10.3   76     6  20
## 113    NA     150  6.3   77     6  21
## 114    NA      59  1.7   76     6  22
## 115    NA      91  4.6   76     6  23
## 116    NA     250  6.3   76     6  24
## 117    NA     135  8.0   75     6  25
## 118    NA     127  8.0   78     6  26
## 119    NA      47 10.3   73     6  27
## 120    NA      98 11.5   80     6  28
## 121    NA      31 14.9   77     6  29
## 122    NA     138  8.0   83     6  30
## 123    41     190  7.4   67     5   1
## 124    36     118  8.0   72     5   2
## 125    12     149 12.6   74     5   3
## 126    18     313 11.5   62     5   4
## 127    NA      NA 14.3   56     5   5
## 128    28      NA 14.9   66     5   6
## 129    23     299  8.6   65     5   7
## 130    19      99 13.8   59     5   8
## 131     8      19 20.1   61     5   9
## 132    NA     194  8.6   69     5  10
## 133     7      NA  6.9   74     5  11
## 134    16     256  9.7   69     5  12
## 135    11     290  9.2   66     5  13
## 136    14     274 10.9   68     5  14
## 137    18      65 13.2   58     5  15
## 138    14     334 11.5   64     5  16
## 139    34     307 12.0   66     5  17
## 140     6      78 18.4   57     5  18
## 141    30     322 11.5   68     5  19
## 142    11      44  9.7   62     5  20
## 143     1       8  9.7   59     5  21
## 144    11     320 16.6   73     5  22
## 145     4      25  9.7   61     5  23
## 146    32      92 12.0   61     5  24
## 147    NA      66 16.6   57     5  25
## 148    NA     266 14.9   58     5  26
## 149    NA      NA  8.0   57     5  27
## 150    23      13 12.0   67     5  28
## 151    45     252 14.9   81     5  29
## 152   115     223  5.7   79     5  30
## 153    37     279  7.4   76     5  31

Now, let’s put those commands together!

airquality %>% 
    filter(Temp > 70 & Month != 5) %>% 
    group_by(Month) %>% 
    summarise(mean(Temp, na.rm = TRUE))
## # A tibble: 4 × 2
##   Month `mean(Temp, na.rm = TRUE)`
##   <int>                      <dbl>
## 1     6                       80.0
## 2     7                       83.9
## 3     8                       84.0
## 4     9                       79.9
  • tidyr > tidyr is new package that makes it easy to “tidy” your data. Tidy data is data that’s easy to work with: it’s easy to munge (with dplyr), visualise (with ggplot2 or ggvis) and model (with R’s hundreds of modelling packages).

    • gather(data, key, value, …, na.rm = FALSE, convert = FALSE)
library(tidyr)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
mtcars$car <- rownames(mtcars)
mtcars <- mtcars[, c(12, 1:11)]
head(mtcars)
##                                 car  mpg cyl disp  hp drat    wt  qsec vs am
## Mazda RX4                 Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1
## Mazda RX4 Wag         Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1
## Datsun 710               Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1
## Hornet 4 Drive       Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0
## Hornet Sportabout Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0
## Valiant                     Valiant 18.1   6  225 105 2.76 3.460 20.22  1  0
##                   gear carb
## Mazda RX4            4    4
## Mazda RX4 Wag        4    4
## Datsun 710           4    1
## Hornet 4 Drive       3    1
## Hornet Sportabout    3    2
## Valiant              3    1
mtcarNew <- mtcars %>% gather(attribute, value, -car)
head(mtcarNew)
##                 car attribute value
## 1         Mazda RX4       mpg  21.0
## 2     Mazda RX4 Wag       mpg  21.0
## 3        Datsun 710       mpg  22.8
## 4    Hornet 4 Drive       mpg  21.4
## 5 Hornet Sportabout       mpg  18.7
## 6           Valiant       mpg  18.1
tail(mtcarNew)
##                car attribute value
## 347  Porsche 914-2      carb     2
## 348   Lotus Europa      carb     2
## 349 Ford Pantera L      carb     4
## 350   Ferrari Dino      carb     6
## 351  Maserati Bora      carb     8
## 352     Volvo 142E      carb     2
* spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE)
mtcarSpread <- mtcarNew %>% spread(attribute, value)
head(mtcarSpread)
##                  car am carb cyl disp drat gear  hp  mpg  qsec vs    wt
## 1        AMC Javelin  0    2   8  304 3.15    3 150 15.2 17.30  0 3.435
## 2 Cadillac Fleetwood  0    4   8  472 2.93    3 205 10.4 17.98  0 5.250
## 3         Camaro Z28  0    4   8  350 3.73    3 245 13.3 15.41  0 3.840
## 4  Chrysler Imperial  0    4   8  440 3.23    3 230 14.7 17.42  0 5.345
## 5         Datsun 710  1    1   4  108 3.85    4  93 22.8 18.61  1 2.320
## 6   Dodge Challenger  0    2   8  318 2.76    3 150 15.5 16.87  0 3.520
* unite(data, col, ..., sep = "_", remove = TRUE)
set.seed(1)
date <- as.Date('2016-01-01') + 0:14
hour <- sample(1:24, 15)
min <- sample(1:60, 15)
second <- sample(1:60, 15)
event <- sample(letters, 15)
data <- data.frame(date, hour, min, second, event)
data
##          date hour min second event
## 1  2016-01-01    4  15     35     w
## 2  2016-01-02    7  21      6     x
## 3  2016-01-03    1  37     10     f
## 4  2016-01-04    2  41     42     g
## 5  2016-01-05   11  25     38     s
## 6  2016-01-06   14  46     47     j
## 7  2016-01-07   18  58     20     y
## 8  2016-01-08   22  54     28     n
## 9  2016-01-09    5  34     54     b
## 10 2016-01-10   16  42     44     m
## 11 2016-01-11   10  56     23     r
## 12 2016-01-12    6  44     59     t
## 13 2016-01-13   19  60     40     v
## 14 2016-01-14   23  33     51     o
## 15 2016-01-15    9  20     25     a
dataNew <- data %>%
  unite(datehour, date, hour, sep = ' ') %>%
  unite(datetime, datehour, min, second, sep = ':')
dataNew
##               datetime event
## 1   2016-01-01 4:15:35     w
## 2    2016-01-02 7:21:6     x
## 3   2016-01-03 1:37:10     f
## 4   2016-01-04 2:41:42     g
## 5  2016-01-05 11:25:38     s
## 6  2016-01-06 14:46:47     j
## 7  2016-01-07 18:58:20     y
## 8  2016-01-08 22:54:28     n
## 9   2016-01-09 5:34:54     b
## 10 2016-01-10 16:42:44     m
## 11 2016-01-11 10:56:23     r
## 12  2016-01-12 6:44:59     t
## 13 2016-01-13 19:60:40     v
## 14 2016-01-14 23:33:51     o
## 15  2016-01-15 9:20:25     a
* separate(data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE, extra = "warn", fill = "warn", ...)
data1 <- dataNew %>% 
  separate(datetime, c('date', 'time'), sep = ' ') %>% 
  separate(time, c('hour', 'min', 'second'), sep = ':')
data1
##          date hour min second event
## 1  2016-01-01    4  15     35     w
## 2  2016-01-02    7  21      6     x
## 3  2016-01-03    1  37     10     f
## 4  2016-01-04    2  41     42     g
## 5  2016-01-05   11  25     38     s
## 6  2016-01-06   14  46     47     j
## 7  2016-01-07   18  58     20     y
## 8  2016-01-08   22  54     28     n
## 9  2016-01-09    5  34     54     b
## 10 2016-01-10   16  42     44     m
## 11 2016-01-11   10  56     23     r
## 12 2016-01-12    6  44     59     t
## 13 2016-01-13   19  60     40     v
## 14 2016-01-14   23  33     51     o
## 15 2016-01-15    9  20     25     a
  • purrr

    purrr enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors. If you’ve never heard of FP before, the best place to start is the family of map() functions which allow you to replace many for loops with code that is both more succinct and easier to read. The best place to learn about the map() functions is the iteration chapter in R for data science.

library(purrr)
## 
## Attaching package: 'purrr'
## The following object is masked from 'package:plyr':
## 
##     compact
## The following object is masked from 'package:data.table':
## 
##     transpose
mtcars %>%
  split(.$cyl) %>% # from base R
  map(~ lm(mpg ~ wt, data = .)) %>%
  map(summary) %>%
  map_dbl("r.squared")
##         4         6         8 
## 0.5086326 0.4645102 0.4229655
  • stringr

    stringr is built on top of stringi, which uses the ICU C library to provide fast, correct implementations of common string manipulations. stringr focusses on the most important and commonly used string manipulation functions whereas stringi provides a comprehensive set covering almost anything you can imagine.

library(stringr)

x <- c("why", "video", "cross", "extra", "deal", "authority")
str_length(x) 
## [1] 3 5 5 5 4 9
str_c(x, collapse = ", ")
## [1] "why, video, cross, extra, deal, authority"
str_sub(x, 1, 2)
## [1] "wh" "vi" "cr" "ex" "de" "au"
str_dup(x, 2:7)
## [1] "whywhy"                                                         
## [2] "videovideovideo"                                                
## [3] "crosscrosscrosscross"                                           
## [4] "extraextraextraextraextra"                                      
## [5] "dealdealdealdealdealdeal"                                       
## [6] "authorityauthorityauthorityauthorityauthorityauthorityauthority"
str_subset(x, "[aeiou]")
## [1] "video"     "cross"     "extra"     "deal"      "authority"
str_count(x, "[aeiou]")
## [1] 0 3 1 2 2 4
str_detect(x, "[aeiou]")
## [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
str_subset(x, "[aeiou]")
## [1] "video"     "cross"     "extra"     "deal"      "authority"
str_locate(x, "[aeiou]")
##      start end
## [1,]    NA  NA
## [2,]     2   2
## [3,]     3   3
## [4,]     1   1
## [5,]     2   2
## [6,]     1   1
str_extract(x, "[aeiou]")
## [1] NA  "i" "o" "e" "e" "a"
str_match(x, "(.)[aeiou](.)")
##      [,1]  [,2] [,3]
## [1,] NA    NA   NA  
## [2,] "vid" "v"  "d" 
## [3,] "ros" "r"  "s" 
## [4,] NA    NA   NA  
## [5,] "dea" "d"  "a" 
## [6,] "aut" "a"  "t"
str_replace(x, "[aeiou]", "?")
## [1] "why"       "v?deo"     "cr?ss"     "?xtra"     "d?al"      "?uthority"
str_split(c("a,b", "c,d,e"), ",")
## [[1]]
## [1] "a" "b"
## 
## [[2]]
## [1] "c" "d" "e"

What’s new in R 4.1.0?

1. New pipe operator: |>

rnorm(100, mean = 4, sd = 1) |>
  density() |>
  plot()

c("Homo sapiens", "Mus musculus", "Rattus norvegicus") |> {function(i) grepl("homo", i, ignore.case = TRUE)}()
## [1]  TRUE FALSE FALSE

2. Simplified function statement with \

  • How did we write a self-defined function in map() function before R 4.1.0?
map(
  letters[2:3],
  function(x) {
    pattern <- paste0("^", x)
    grep(pattern, ls("package:datasets"), value = TRUE, ignore.case = TRUE)
  }
)
  • Since R 4.1.0, we can write it again in this style:
map(
  letters[2:3],
  \(x){
    pattern <- paste0("^", x)
    grep(pattern, ls("package:datasets"), value = TRUE, ignore.case = TRUE)
  }
)
## [[1]]
## [1] "beaver1"      "beaver2"      "BJsales"      "BJsales.lead" "BOD"         
## 
## [[2]]
## [1] "cars"        "ChickWeight" "chickwts"    "co2"         "CO2"        
## [6] "crimtab"

Updates in R 4.2.0

  • Use pipe operator more elegantly with the underscore placeholder **_**
    • In R 4.1
mtcars |> (\(x) lm(hp ~ cyl, data = x))()
  • In R 4.2
mtcars |> lm(hp ~ cyl, data = _)

Drawing graph

ggplot2 - The Art of Grammar of Graphics

ggplot2 is an R plotting system based on the Grammar of Graphics, which takes the good parts of base R and lattice graphics systems while eliminating the bad parts. It not only handles tedious plotting details (such as legend creation) but also provides a powerful graphics model that makes creating complex multi-layered graphics simple.

In-depth Analysis of Grammar of Graphics Core Philosophy

Why Do We Need a Grammar of Graphics?

In traditional graphics systems, we are usually limited to predefined chart types (scatter plots, bar charts, line plots, etc.). This is like being able to say only fixed sentences without being able to combine vocabulary to express new ideas.

# Traditional thinking: Choose fixed chart types
plot(x = mtcars$wt, y = mtcars$mpg)  # Scatter plot
barplot(table(mtcars$cyl))           # Bar chart

# Grammar of Graphics thinking: Combine grammar elements
ggplot(mtcars, aes(x = wt, y = mpg)) +     # Data + aesthetic mapping
  geom_point() +                           # Geometric object
  stat_smooth(method = "lm") +             # Statistical transformation
  facet_wrap(~ cyl) +                      # Faceting
  theme_minimal()                          # Theme

Seven Core Components of Grammar of Graphics

“By understanding the grammar and how its components fit together, you can create a wider range of visualizations” - Hadley Wickham

  1. Data: The dataset to be visualized
  2. Aesthetic Mappings: How variables map to visual properties
  3. Layers: Composed of geometric objects and statistical transformations
  4. Scales: Control mapping from data values to aesthetic attributes
  5. Coordinate System: Define how data coordinates are displayed
  6. Facets: Split data into multiple subplots
  7. Themes: Control the overall visual appearance of the plot
Component 1: Data - The Foundation of Everything
library(ggplot2)
library(dplyr)

# Explore our data
head(mtcars)
##                                 car  mpg cyl disp  hp drat    wt  qsec vs am
## Mazda RX4                 Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1
## Mazda RX4 Wag         Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1
## Datsun 710               Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1
## Hornet 4 Drive       Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0
## Hornet Sportabout Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0
## Valiant                     Valiant 18.1   6  225 105 2.76 3.460 20.22  1  0
##                   gear carb
## Mazda RX4            4    4
## Mazda RX4 Wag        4    4
## Datsun 710           4    1
## Hornet 4 Drive       3    1
## Hornet Sportabout    3    2
## Valiant              3    1
str(mtcars)
## 'data.frame':    32 obs. of  12 variables:
##  $ car : chr  "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
# ggplot2's first principle: Tidy Data
# Each row is an observation, each column is a variable
Component 2: Aesthetic Mappings - Bridge Between Data and Visuals

Aesthetic mappings define how data variables correspond to visual properties. This is the core concept of Grammar of Graphics.

# Basic aesthetic mapping
p_base <- ggplot(mtcars, aes(x = wt, y = mpg))

# Global aesthetic mapping vs local aesthetic mapping
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +  # Global
  geom_point() +
  geom_smooth(aes(color = NULL), method = "lm", color = "black")  # Local override
## `geom_smooth()` using formula = 'y ~ x'

# Types of aesthetic mappings
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(aes(
    color = hp,        # Continuous variable mapped to color
    size = qsec,       # Continuous variable mapped to size
    shape = factor(am), # Categorical variable mapped to shape
    alpha = vs         # Continuous variable mapped to transparency
  ))

Aesthetic Mapping vs Setting Fixed Aesthetic Properties

# Mapping (inside aes()) vs Setting (outside aes())
p1 <- ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(aes(color = factor(cyl))) +  # Mapping: color corresponds to data
  labs(title = "Mapping: Color by Cylinder Count")

p2 <- ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(color = "red") +  # Setting: all points are red
  labs(title = "Setting: All Points Are Red")

# Display the difference
library(patchwork)
p1 / p2

Component 3: Layers - Building Blocks for Complex Graphics

In Grammar of Graphics, each layer contains five components:

  1. Data: Layer-specific data
  2. Aesthetic mappings: Mapping from variables to visual properties
  3. Geom: Type of geometric object
  4. Stat: Statistical transformation
  5. Position: Position adjustment
# Understanding layer structure
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
  # Layer 1: Raw data points
  geom_point(aes(color = "Raw Data"), position = position_jitter(width = 0.2)) +
  # Layer 2: Statistical summary (mean)
  stat_summary(fun = mean, geom = "point", 
               aes(color = "Mean"), size = 4, shape = 17) +
  # Layer 3: Error bars
  stat_summary(fun.data = mean_se, geom = "errorbar", 
               aes(color = "Std Error"), width = 0.2) +
  # Layer 4: Trend line
  geom_smooth(aes(group = 1, color = "Trend Line"), method = "lm", se = FALSE) +
  scale_color_manual(values = c("Raw Data" = "gray60", 
                               "Mean" = "red", 
                               "Std Error" = "blue",
                               "Trend Line" = "darkgreen")) +
  labs(title = "Multi-layer Display: Engine Cylinders vs Fuel Efficiency",
       x = "Cylinders", y = "Miles Per Gallon", color = "Layer Type")
## `geom_smooth()` using formula = 'y ~ x'

Relationship Between Geom and Stat

# Every geom has a corresponding default stat
# geom_bar()'s default stat is "count"
p1 <- ggplot(mtcars, aes(x = factor(cyl))) +
  geom_bar() +
  labs(title = "geom_bar() + stat_count()")

# stat_count()'s default geom is "bar"
p2 <- ggplot(mtcars, aes(x = factor(cyl))) +
  stat_count() +
  labs(title = "stat_count() + geom_bar()")

# Change default behavior
p3 <- ggplot(mtcars, aes(x = factor(cyl))) +
  geom_bar(stat = "count") +  # Explicitly specify
  labs(title = "Explicitly Specify Stat")

p4 <- ggplot(mtcars, aes(x = factor(cyl))) +
  stat_count(geom = "point", size = 4) +  # Change geom
  labs(title = "Display Count with Points")

(p1 | p2) / (p3 | p4)

Component 4: Scales - Bridge from Data Space to Visual Space

Scales control how data values are mapped to visual properties. Each aesthetic mapping requires a corresponding scale.

# Understanding scale function
# Continuous scales
ggplot(mtcars, aes(x = wt, y = mpg, color = hp)) +
  geom_point(size = 3) +
  scale_color_gradient(name = "Horsepower",
                      low = "blue", high = "red",
                      breaks = c(100, 200, 300),
                      labels = c("Low", "Medium", "High")) +
  labs(title = "Continuous Scale: Horsepower Color Mapping")

# Discrete scales
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
  geom_point(size = 3) +
  scale_color_manual(name = "Cylinders",
                    values = c("4" = "#E69F00", "6" = "#56B4E9", "8" = "#CC79A7"),
                    labels = c("Four", "Six", "Eight")) +
  labs(title = "Discrete Scale: Cylinder Color Mapping")

Scale Transformations and Coordinate Systems

# Logarithmic transformation
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  scale_x_log10(name = "Weight (Log Scale)", 
                labels = label_comma()) +
  scale_y_log10(name = "MPG (Log Scale)",
                labels = label_comma()) +
  labs(title = "Logarithmic Scale Transformation")

# Custom breaks and labels
ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point() +
  scale_x_continuous(name = "Horsepower",
                    breaks = seq(50, 350, by = 50),
                    labels = paste0(seq(50, 350, by = 50), "HP")) +
  scale_y_continuous(name = "Fuel Efficiency",
                    breaks = seq(10, 35, by = 5),
                    labels = paste0(seq(10, 35, by = 5), "mpg")) +
  labs(title = "Custom Axis Labels and Breaks")

Component 5: Coordinate Systems - The Art of Positioning

Coordinate systems determine how position aesthetic attributes are mapped to the actual positions in the graphic.

# Cartesian coordinate system (default)
p1 <- ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
  geom_boxplot() +
  labs(title = "Cartesian Coordinate System")

# Flipped coordinate system
p2 <- ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
  geom_boxplot() +
  coord_flip() +
  labs(title = "Flipped Coordinate System")

# Polar coordinate system
p3 <- ggplot(mtcars, aes(x = factor(cyl), fill = factor(cyl))) +
  geom_bar() +
  coord_polar(theta = "x") +
  labs(title = "Polar Coordinate System - Pie Chart")

# Fixed ratio coordinate system
p4 <- ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  coord_fixed(ratio = 1) +
  labs(title = "Fixed Ratio Coordinate System")

(p1 | p2) / (p3 | p4)

Component 6: Facets - The Power of Small Multiples

Faceting allows splitting data into subsets, with each subset displayed in a separate panel.

# facet_wrap: one-dimensional faceting
p1 <- ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~ cyl, nrow = 2, 
             labeller = labeller(cyl = function(x) paste(x, "Cylinders"))) +
  labs(title = "facet_wrap: Faceted by Cylinder Count")

# facet_grid: two-dimensional faceting
p2 <- ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  facet_grid(am ~ cyl, 
             labeller = labeller(am = c("0" = "Automatic", "1" = "Manual"),
                                cyl = function(x) paste(x, "Cyl"))) +
  labs(title = "facet_grid: Transmission Type × Cylinder Count")

p1 / p2
## `geom_smooth()` using formula = 'y ~ x'

Advanced Faceting Usage

# Free scales
ggplot(mtcars, aes(x = disp, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~ cyl, scales = "free", 
             labeller = labeller(cyl = function(x) paste("Cylinders:", x))) +
  labs(title = "Free Scale Faceting: Each Panel Has Independent Axis Ranges",
       x = "Displacement", y = "Fuel Efficiency")
## `geom_smooth()` using formula = 'y ~ x'

Component 7: Themes - The Final Mile of Visual Aesthetics

Themes control the non-data elements of the plot, such as fonts, colors, and lines.

# Base plot
base_plot <- ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
  geom_point(size = 3) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Different Theme Showcase",
       subtitle = "Same Data, Different Feel",
       x = "Weight (1000 lbs)", 
       y = "Fuel Efficiency (mpg)",
       color = "Cylinders")

# Built-in theme comparison
p1 <- base_plot + theme_gray() + labs(title = "theme_gray (default)")
p2 <- base_plot + theme_minimal() + labs(title = "theme_minimal")
p3 <- base_plot + theme_classic() + labs(title = "theme_classic")
p4 <- base_plot + theme_void() + labs(title = "theme_void")

(p1 | p2) / (p3 | p4)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Custom Themes

# Create custom theme
my_theme <- theme_minimal() +
  theme(
    # Title settings
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 12, hjust = 0.5, color = "gray50"),
    # Axis settings
    axis.title = element_text(size = 12, face = "bold"),
    axis.text = element_text(size = 10),
    # Legend settings
    legend.title = element_text(size = 12, face = "bold"),
    legend.text = element_text(size = 10),
    legend.position = "bottom",
    # Panel settings
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(color = "gray90", size = 0.5),
    # Background settings
    plot.background = element_rect(fill = "white", color = NA),
    panel.background = element_rect(fill = "gray98", color = NA)
  )
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Apply custom theme
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
  geom_point(size = 3, alpha = 0.8) +
  geom_smooth(method = "lm", se = TRUE, alpha = 0.3) +
  scale_color_viridis_d(name = "Cylinders") +
  labs(title = "Custom Theme Showcase",
       subtitle = "Professional-Grade Visualization Design",
       x = "Weight (1000 lbs)", 
       y = "Fuel Efficiency (mpg)") +
  my_theme
## `geom_smooth()` using formula = 'y ~ x'

Grammar of Graphics Practical Case: Layer-by-Layer Construction

Let’s demonstrate how to use Grammar of Graphics thinking to construct complex visualizations through a complete example.

# Step 1: Basic data layer
p <- ggplot(mtcars, aes(x = wt, y = mpg))

# Step 2: Add geometric objects
p <- p + geom_point(aes(color = factor(cyl), size = hp), alpha = 0.7)

# Step 3: Add statistical transformation
p <- p + geom_smooth(method = "lm", se = TRUE, color = "black", linetype = "dashed")

# Step 4: Customize scales
p <- p + 
  scale_color_manual(name = "Cylinders",
                    values = c("4" = "#E69F00", "6" = "#56B4E9", "8" = "#CC79A7")) +
  scale_size_continuous(name = "Horsepower", range = c(2, 8), 
                       breaks = c(100, 200, 300),
                       labels = c("100HP", "200HP", "300HP"))

# Step 5: Faceting
p <- p + facet_wrap(~ am, labeller = labeller(am = c("0" = "Automatic", "1" = "Manual")))

# Step 6: Add labels
p <- p + labs(
  title = "Grammar of Graphics in Practice: Car Performance Analysis",
  subtitle = "Exploring Relationships Between Weight, Fuel Efficiency, Power, and Transmission",
  x = "Weight (1000 lbs)",
  y = "Fuel Efficiency (mpg)",
  caption = "Data Source: mtcars dataset"
)

# Step 7: Apply theme
p <- p + theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5, color = "gray50"),
    strip.text = element_text(size = 11, face = "bold"),
    legend.position = "bottom"
  )

print(p)
## `geom_smooth()` using formula = 'y ~ x'

Advanced Grammar Concepts: Position Adjustments

Position adjustments solve the problem of overlapping geometric objects and are an important concept in Grammar of Graphics.

# Prepare data
df <- data.frame(
  category = rep(c("A", "B", "C"), each = 100),
  value = c(rnorm(100, 5, 1), rnorm(100, 7, 1.5), rnorm(100, 6, 2)),
  group = rep(c("Group1", "Group2"), 150)
)

# Different position adjustments
p1 <- ggplot(df, aes(x = category, y = value, fill = group)) +
  geom_col(position = "stack") +
  labs(title = "position = 'stack'")

p2 <- ggplot(df, aes(x = category, y = value, fill = group)) +
  geom_col(position = "dodge") +
  labs(title = "position = 'dodge'")

p3 <- ggplot(df, aes(x = category, y = value, fill = group)) +
  geom_col(position = "fill") +
  labs(title = "position = 'fill'")

p4 <- ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
  geom_point(position = position_jitter(width = 0.3, height = 0)) +
  labs(title = "position_jitter")

(p1 | p2) / (p3 | p4)

Complete Grammar of Graphics Workflow
# Simulate a data science project visualization workflow
# 1. Data preparation
mtcars_extended <- mtcars %>%
  mutate(
    efficiency_category = case_when(
      mpg < 15 ~ "Low Efficiency",
      mpg < 25 ~ "Medium Efficiency", 
      TRUE ~ "High Efficiency"
    ),
    weight_category = case_when(
      wt < 2.5 ~ "Light",
      wt < 4 ~ "Medium",
      TRUE ~ "Heavy"
    )
  )

# 2. Exploratory visualization: Using Grammar of Graphics
final_plot <- ggplot(mtcars_extended, 
                    aes(x = wt, y = mpg)) +
  # Data layer 1: Background trend
  geom_smooth(method = "loess", se = TRUE, alpha = 0.2, color = "gray50") +
  # Data layer 2: Main data points
  geom_point(aes(color = efficiency_category, 
                size = hp,
                shape = factor(am)), 
            alpha = 0.8) +
  # Data layer 3: Category trend lines
  geom_smooth(aes(color = efficiency_category), 
             method = "lm", se = FALSE, size = 1.2) +
  # Scale settings
  scale_color_manual(name = "Efficiency Level",
                    values = c("Low Efficiency" = "#d73027", 
                             "Medium Efficiency" = "#fee08b", 
                             "High Efficiency" = "#1a9850")) +
  scale_size_continuous(name = "Horsepower", range = c(2, 8)) +
  scale_shape_manual(name = "Transmission",
                    values = c("0" = 16, "1" = 17),
                    labels = c("Automatic", "Manual")) +
  # Faceting
  facet_wrap(~ weight_category, scales = "free_x",
             labeller = labeller(weight_category = function(x) paste("Weight Category:", x))) +
  # Labels
  labs(
    title = "Comprehensive Car Performance Analysis",
    subtitle = "Using Grammar of Graphics to Explore Complex Relationships Between Weight, Fuel Efficiency, Power, and Transmission",
    x = "Vehicle Weight (1000 lbs)",
    y = "Fuel Efficiency (miles per gallon)",
    caption = "Data visualization demonstrates the combined power of Grammar of Graphics\nEach visual element carries specific information"
  ) +
  # Theme
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 12, hjust = 0.5, color = "gray40"),
    plot.caption = element_text(size = 9, color = "gray50", hjust = 1),
    strip.text = element_text(size = 11, face = "bold"),
    legend.position = "bottom",
    legend.box = "horizontal",
    panel.grid.minor = element_blank()
  ) +
  guides(
    color = guide_legend(override.aes = list(size = 4)),
    size = guide_legend(override.aes = list(alpha = 1)),
    shape = guide_legend(override.aes = list(size = 4))
  )
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
print(final_plot)
## `geom_smooth()` using formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : span too small.  fewer data values than degrees of freedom.
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 4.0632
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 1.2818
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 0.032678
## Warning in sqrt(sum.squares/one.delta): NaNs produced
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : span too small.  fewer
## data values than degrees of freedom.
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 4.0632
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 1.2818
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 0
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 0.032678
## Warning in stats::qt(level/2 + 0.5, pred$df): NaNs produced
## `geom_smooth()` using formula = 'y ~ x'
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

Grammar of Graphics Learning Key Points Summary
  1. Mindset Shift: From “choosing chart types” to “combining grammar elements”
  2. Layering Concept: Each layer has a clear purpose and function
  3. Mapping vs Setting: Understanding when to use aes() and when not to
  4. Scale Importance: Each aesthetic mapping requires a corresponding scale
  5. Iterative Improvement: Grammar of Graphics supports gradual plot refinement
Basic Syntax Review and Hands-on Practice

After gaining a deep understanding of the theoretical foundation of Grammar of Graphics, let’s review the basic syntax structure:

# ggplot2 basic syntax structure
ggplot(data = <DATA>) +                    # 1. Specify data
  <GEOM_FUNCTION>(                         # 2. Choose geometric object
    mapping = aes(<MAPPINGS>),             # 3. Define aesthetic mapping
    stat = <STAT>,                         # 4. Statistical transformation (optional)
    position = <POSITION>                  # 5. Position adjustment (optional)
  ) +
  <SCALE_FUNCTION>() +                     # 6. Custom scales (optional)
  <COORDINATE_FUNCTION>() +                # 7. Coordinate system (optional)
  <FACET_FUNCTION>() +                     # 8. Faceting (optional)
  <THEME_FUNCTION>()                       # 9. Theme (optional)

Quick Reference for Common Geometric Objects

Common geom cases for one variable
Common geom cases for one variable
Common geom cases for two variables
Common geom cases for two variables
Hands-on Practice: From Basic to Advanced

Practice 1: Progressive Construction of Scatter Plots

Let’s use Grammar of Graphics thinking to progressively build a scatter plot:

# Step 1: Only specify data and basic mapping
p1 <- ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  labs(title = "Step 1: Basic Scatter Plot")

# Step 2: Add color mapping
p2 <- ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(aes(color = factor(cyl))) +
  labs(title = "Step 2: Add Color")

# Step 3: Add size mapping
p3 <- ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(aes(color = factor(cyl), size = hp)) +
  labs(title = "Step 3: Add Size")

# Step 4: Add shape and transparency
p4 <- ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(aes(color = factor(cyl), size = hp, shape = factor(am)), alpha = 0.7) +
  labs(title = "Step 4: Multiple Aesthetic Mappings")

(p1 | p2) / (p3 | p4)

Practice 2: Applying Grammar Components

# Apply all the components we've learned
ggplot(mtcars, aes(x = wt, y = mpg)) +
  # Layer 1: Scatter plot
  geom_point(aes(color = factor(cyl), size = hp), alpha = 0.6) +
  # Layer 2: Trend line
  geom_smooth(method = "lm", se = TRUE, color = "black") +
  # Scale settings
  scale_color_brewer(name = "Cylinders", type = "qual", palette = "Set1") +
  scale_size_continuous(name = "Horsepower", range = c(2, 6)) +
  # Coordinate system
  coord_cartesian(xlim = c(1, 6), ylim = c(10, 35)) +
  # Labels
  labs(
    title = "Grammar of Graphics Practice",
    subtitle = "Composite Plot Integrating Multiple Components",
    x = "Weight (1000 lbs)",
    y = "Fuel Efficiency (mpg)",
    caption = "Demonstrates integration of data, aesthetic mappings, layers, scales, and coordinate systems"
  ) +
  # Theme
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5),
    legend.position = "bottom"
  )
## `geom_smooth()` using formula = 'y ~ x'

In-depth Exploration of Common Geometric Objects

After understanding the theoretical foundation of Grammar of Graphics, let’s explore the practical applications of various geometric objects:

Line Plots

# Line plot using economics dataset
ggplot(data = economics) + 
  geom_line(mapping = aes(x = date, y = unemploy))

Bar Charts
# Bar chart of car counts by cylinder
ggplot(data = mtcars) + 
  geom_bar(mapping = aes(x = factor(cyl), fill = factor(cyl)))

Histograms
# Histogram of mpg distribution
ggplot(data = mtcars) + 
  geom_histogram(mapping = aes(x = mpg), bins = 10, fill = "skyblue", color = "black")

Box Plots
# Box plot of mpg by cylinder
ggplot(data = mtcars) + 
  geom_boxplot(mapping = aes(x = factor(cyl), y = mpg, fill = factor(cyl)))

Statistical Transformations
# Adding smooth trend line
ggplot(data = mtcars, aes(x = wt, y = mpg)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = TRUE)
## `geom_smooth()` using formula = 'y ~ x'

# Statistical summary
ggplot(data = mtcars, aes(x = factor(cyl), y = mpg)) + 
  geom_point(position = "jitter", alpha = 0.6) +
  stat_summary(fun = mean, geom = "point", color = "red", size = 3) +
  stat_summary(fun.data = mean_se, geom = "errorbar", color = "red", width = 0.2)

Coordinate Systems and Scales
# Coordinate transformation
ggplot(data = mtcars, aes(x = wt, y = mpg)) + 
  geom_point() + 
  coord_flip()  # Flip x and y axes

# Custom scales
ggplot(data = mtcars, aes(x = wt, y = mpg, color = hp)) + 
  geom_point(size = 3) + 
  scale_color_gradient(low = "blue", high = "red") +
  scale_x_continuous(name = "Weight (1000 lbs)") +
  scale_y_continuous(name = "Miles per Gallon")

Faceting (Multiple Panels)
# Facet wrap
ggplot(data = mtcars, aes(x = wt, y = mpg)) + 
  geom_point() + 
  facet_wrap(~ cyl, nrow = 2)

# Facet grid
ggplot(data = mtcars, aes(x = wt, y = mpg)) + 
  geom_point() + 
  facet_grid(am ~ cyl, labeller = label_both)

Customizing Themes and Appearance
# Using built-in themes
ggplot(data = mtcars, aes(x = wt, y = mpg, color = factor(cyl))) + 
  geom_point(size = 3) + 
  theme_minimal()

# Custom theme modifications
ggplot(data = mtcars, aes(x = wt, y = mpg, color = factor(cyl))) + 
  geom_point(size = 3) + 
  labs(title = "Car Weight vs Fuel Efficiency",
       subtitle = "Relationship between weight and MPG by cylinder count",
       x = "Weight (1000 lbs)",
       y = "Miles per Gallon",
       color = "Cylinders",
       caption = "Data source: mtcars dataset") +
  theme_classic() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size = 12, color = "gray50"),
    legend.position = "bottom",
    panel.grid.major = element_line(color = "gray90", size = 0.5)
  )

Advanced Customization Techniques
Creating Professional Publication-Ready Plots
# Professional-looking plot with multiple layers
p <- ggplot(data = mtcars, aes(x = wt, y = mpg)) + 
  geom_point(aes(color = factor(cyl), size = hp), alpha = 0.7) + 
  geom_smooth(method = "lm", se = TRUE, color = "black", linetype = "dashed") +
  scale_color_manual(values = c("4" = "#E69F00", "6" = "#56B4E9", "8" = "#CC79A7"),
                     name = "Cylinders") +
  scale_size_continuous(name = "Horsepower", range = c(2, 6)) +
  labs(
    title = "Relationship Between Car Weight and Fuel Efficiency",
    subtitle = "Data points colored by cylinder count and sized by horsepower",
    x = "Weight (1000 lbs)",
    y = "Miles per Gallon (MPG)",
    caption = "Source: Motor Trend Car Road Tests (mtcars dataset)"
  ) +
  theme_bw() +
  theme(
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5, color = "gray40"),
    plot.caption = element_text(size = 9, color = "gray50"),
    legend.position = "right",
    legend.box = "vertical",
    panel.grid.minor = element_blank(),
    strip.background = element_rect(fill = "gray90")
  )

print(p)
## `geom_smooth()` using formula = 'y ~ x'

Combining Multiple Plots
# Using patchwork package for combining plots (install if needed)
# install.packages("patchwork")
library(patchwork)

# Create individual plots
p1 <- ggplot(mtcars, aes(x = factor(cyl), y = mpg)) + 
  geom_boxplot(fill = "lightblue") + 
  labs(title = "MPG by Cylinders", x = "Cylinders", y = "MPG")

p2 <- ggplot(mtcars, aes(x = hp, y = mpg)) + 
  geom_point(color = "red") + 
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "MPG vs Horsepower", x = "Horsepower", y = "MPG")

p3 <- ggplot(mtcars, aes(x = mpg)) + 
  geom_histogram(bins = 10, fill = "green", alpha = 0.7) +
  labs(title = "MPG Distribution", x = "MPG", y = "Count")

# Combine plots
(p1 | p2) / p3
## `geom_smooth()` using formula = 'y ~ x'

Practical Tips for Effective Plotting
  1. Start Simple: Begin with basic plots and add complexity gradually
  2. Choose Appropriate Geoms: Match the geometry to your data type
  3. Use Color Wisely: Ensure accessibility with colorblind-friendly palettes
  4. Label Everything: Always include informative titles, axis labels, and legends
  5. Maintain Consistency: Use consistent styling across related plots
  6. Consider Your Audience: Adjust complexity based on who will view the plot
Useful ggplot2 Extensions
# Popular ggplot2 extension packages
install.packages(c("ggthemes", "viridis", "plotly", "gganimate"))

# Example with ggthemes
library(ggthemes)
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) + 
  geom_point(size = 3) + 
  theme_economist() +
  scale_color_economist()

Version control

Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later. For the examples in this book you will use software source code as the files being version controlled, though in reality you can do this with nearly any type of file on a computer. –from git

Installing git

Git setup

git config --global user.name "YOUR NAME"
git config --global user.email you.email@address.org
git config --global core.ui true
git config --global core.editor vim

# For windows users
git config --global core.quotepath off

Git basics

## Initializing a repository in an existing directory
# Go to the project's directory and type
git init

# Add files you want to track
git add LICENSE
git add READ.md
git commit -m 'First commit. Add LICENSE & READ.md'

# Add new files
git add R.Rmd
git add helloworld.r
git commit -m 'Second commit. Add R.Rmd, helloworld.r'
git remote add origin
git push -u origin master

# Recover your codes to the last commit
git checkout -- filename
git reset --hard


## Cloning an existing repository
git clone https://github.com/godkin1211/Rcourses.git
git pull https://github.com/godkin1211/Rcourses.git

References